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IMPROVED APPARATUS & METHOD FOR MODULAR MULTIPLICATION & 
EXPONENTIATION BASED ON MONTGOMERY MULTIPLICATION 

FIELD OF THE INVENTION 
The present invention relates to apparatus and methods for modular multiplication 
and exponentiation and for serial integer division. 

BACKGROUND OF THE INVENTION 

A compact microelectronic device for performing modular multiplication and 
exponentiation over large numbers is described in Applicant's U.S. Patent 5,513,133, the 
disclosure of which is hereby incorporated by reference. 

The disclosures of all publications mentioned in the specification and of the 
publications cited therein are hereby incorporated by reference. 

SUMMARY OF THE INVENTION 

The present invention seeks to provide improved apparatus and methods for 
modular multiplication and exponentiation and for serial integer division. 

There is thus provided, in accordance with a preferred embodiment of the present 
invention, a modular multiplication and exponentiation system including a serial-parallel 
arithmetic logic unit (ALU) including a single multiplier including a single carry-save 
adder and preferably including a serial division device operative to receive a dividend of 
any bit length and a divisor of any bit length and to compute a quotient and a remainder. 

Further in accordance with a preferred embodiment of the present invention, the 
system is operative to multiply at least one pair of integer inputs of any bit length. 

Still further in accordance with a preferred embodiment of the present invention, 
the at least one pair of integer inputs includes two pairs of integer inputs. 

Additionally in accordance with a preferred embodiment of the present invention, 
the ALU is operative to generate a product of integer inputs and to reduce the size of the 
product without previously computing a zero-forcing Montgomery constant, J Q . 
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Also provided, in accordance with another preferred embodiment of the present 
invention, is serial integer division apparatus including a serial division device operative 
to receive a dividend of any bit length and a divisor of any bit length and to compute a 
quotient and a remainder. 

Further in accordance with a preferred embodiment of the present invention, the 
apparatus includes a pair of registers for storing a pair of integer inputs and which is 
operative to multiply a pair of integer inputs, at least one of which exceeds the bit length 
of its respective register, without interleaving. 

Also provided, in accordance with yet another preferred embodiment of the present 
invention, is a modular multiplication and exponentiation system including a serial- 
parallel multiplying device having only one carry-save accumulator and being operative 
to perform a pair of multiplications and to sum results thereof 

Additionally provided, in accordance with still another preferred embodiment of 
the present invention, is a modular multiplication and exponentiation method including 
providing a serial-parallel arithmetic logic unit (ALU) including a single modular 
multiplying device having a single carry-save adder, and employing the serial-parallel 
ALU to perform modular multiplication and exponentiation. 

Further provided, in accordance with yet another preferred embodiment of the 
present invention, is a method for natural (not modular) multiplication of large integers, 
the method including providing a serial-parallel arithmetic logic unit (ALU) including a 
single modular multiplying device having a single carry-save adder, and employing the 
serial-parallel ALU to perform natural (not modular) multiplication of large integers. 

Further in accordance with a preferred embodiment of the present invention, the 
employing step includes multiplying a first integer of any bit length by a second integer of 
any bit length to obtain a first product, multiplying a third integer of any bit length by a 
fourth integer of any bit length to obtain a second product, and summing the first and 
second products with a fifth integer of any bit length to obtain a sum. 

Still further in accordance with a preferred embodiment of the present invention, 
the employing step includes performing modular multiplication and exponentiation with a 
multiplicand, multiplier and modulus of any bit length. 
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Additionally in accordance with a preferred embodiment of the present invention, 
the system also includes a double multiplicand precomputing system for executing 
Montgomery modular multiplication with only one precomputed constant. 

Further in accordance with yet another preferred embodiment of the present 
invention, is the employing step includes performing Montgomery multiplication 
including generating a product of integer inputs including a multiplier and a multiplicand, 
and executing modular reduction without previously computing a Montgomery 
constant, J 0 . 

Further in accordance with a preferred embodiment of the present invention, the 
Montgomery constant J Q includes a function of N mod l\ where N is a modulus of the 

modular reduction and k is the bit-length of the multiplicand. 

Still further in accordance with a preferred embodiment of the present invention, 
the employing step includes performing a sequence of interleaved Montgomery 
multiplication operations. 

Additionally in accordance with a preferred embodiment of the present invention, 
each of the interleaved Montgomery multiplication operations is performed without 
previously computing the number of times the modulus must be summated into a 
congruence of the multiplication operation in order to force a result with at least k 
significant zeros. 

Still further in accordance with a preferred embodiment of the present invention, 
the system also includes a data preprocessor operative to collect and serially summate 
multiplicands generated in an i'th interleaved Montgomery multiplication operation 
thereby to generate a sum and to feed in the sum to an (i+l)'th Montgomery 
multiplication operation. 

Additionally in accordance with a preferred embodiment of the present invention, 
the function includes an additive inverse of a multiplicative inverse of N mod lK 

Further in accordance with a preferred embodiment of the present invention, the 
method also comprises computing J 0 by resetting A\ and B to zero and setting S Q = 1. 
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The present invention also relates to a compact microelectronic arithmetic logic 
unit, ALU, for performing modular and normal (natural, non-negative field of integers) 
multiplication, division, addition, subtraction and exponentiation over very large 
integers. When referring to modular multiplication and squaring using Montgomery 
methods, reference is made to the specific parts of the device as a modular arithmetic 
coprocessor, and the acronym, MAP, is used. Reference is also made to the Montgomery 
multiplication methods as MM. 

The present invention also relates to arithmetic processing of large integers. These 
large numbers can be in the natural field of (non-negative) integers or in the Galois field 
of prime numbers, GF(p), and also of composite prime moduli. More specifically, the 
invention relates to a device that can implement modular multiplications/exponentiations 
of large numbers, which is suitable for performing the operations essential to Public Key 
Cryptographic authentication and encryption protocols, which work over increasingly 
large operands and which cannot be executed efficiently with present generation modular 
arithmetic coprocessers, and cannot be executed securely with software implementations. 
The invention can be driven by any 4 bit or longer processor, achieving speeds which can 
surpass present day digital signal processors. 

The present invention also relates to the hardware implementation of large operand 
integer arithmetic, especially as concerns the numerical manipulations in a derivative of a 
procedure known as the interleaved Montgomery multiprecision modular multiplication 
method often used in encryption software oriented systems, but also of intrinsic value in 
basic arithmetic operations on long operand integers; in particular, A-B+C D+S, wherein 
there is no theoretical limit on the sizes of A, B, C, D, or S. In addition, the device is 
especially attuned to perform modular multiplication and exponentiation. The basic 
device is particularly suited to be a modular arithmetic co-processor (MAP), also 
including a device for performing division of very large integers, wherein the divisor can 
have a bit length as long as the modulus register N and the bit length of the dividend can 
be as large as the bit length of two concatenated registers. 

This device preferably performs all of the functions of US Patent 5,513,133, with 
the same order of logic gates, in less than half the number of machine clock cycles. This 
is mostly because there is only one double action serial/parallel multiplier instead of two 
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half size multipliers using the same carry save accumulator mechanism, the main 
component of a conventional serial parallel multiplier. The new arithmetic logic unit, 
ALU, or specifically the modular arithmetic coprocessor, MAP, preferably intrinsically 
obviates a separate multiplication process which would have preceded the new process. 
This process would also have required a second Montgomery constant, J 0 , which is now 
also preferably obviated. Stated differently, instead of the two constants in the previous 
Montgomery procedures, and the delays encountered, only one constant is now 
computed, and the delay caused by the now superfluous J type multiplications (explained 
later) is preferably removed. 

Further, by better control of the data manipulations, between the CPU and this 
peripheral device, operands which are performed on operands longer than the natural 
register size of the device, can preferably be performed at reduced processing times 
using less temporary storage memory. 

Three related methods are known for performing modular multiplication with 
Montgomery's methodology. [P. L. Montgomery, "Modular multiplication without trial 
division", Mathematics of Computation, vol. 44, pp. 519-521, 1985], hereinafter referred 
to as "Montgomery", [SR. Dusse and B.S. Kaliski Jr., "A cryptographic library for the 
Motorola DSP 56000", Proc Eurocrypt '90, Springer- Verlag, Berlin, 1990] hereinafter 
referred to as "Dusse", and the method of US Patent 4,514,592 to Miyaguchi, and the 
method of US Patent 5,101,431, to Even, and the method of US Patent 5,321,752 to 
Iwamura, and the method of US Patent 5,448,639, to Arazi, and the method of US 
Patent 5,513,133 to Gressel. 

The preferred architecture is of a machine that can be integrated to any 
microcontroller design, mapped into the host controller's memory; while working in 
parallel with the controller which for very long commands constantly swap or feed 
operands to and from the data feeding mechanism, allowing for modular arithmetic 
computations of any popular length where the size of the coprocessor volatile memory 

€ 

necessary for manipulations should rarely be more than three times the length of the 
largest operand. 

This solution preferably uses only one multiplying device which inherently serves 
the function of two multiplying devices, in previous implementations. Using present 
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popular technologies, it enables the integration of the complete solution including a 
microcontroller with memories onto a 4 by 4.5 by 0.2 mm microelectronic circuit. 

The invention is also directed to the architecture of a digital device which is 
intended to be a peripheral to a conventional digital processor, with computational, 
logical and architectural novel features relative to the processes published by 
Montgomery and Dusse, as described in detail below. 

A concurrent process and a unique hardware architecture are provided, to perform 
modular exponentiation without division preferably with the same number of operations 
as would be performed with a classic multiplication/division device, wherein a classic 
device would perform both a multiplication and a division on each operation. A 
particular feature of a preferred embodiment of the present invention is the concurrency 
of operations performed by the device to allow for unlimited operand lengths, with 
uninterrupted efficient use of resources, allowing for the basic large operand integer 
arithmetic functions. 

The advantages realized by a preferred embodiment of this invention result from a 
synchronized sequence of serial processes, which are merged to simultaneously (in 
parallel) achieve three multiplication operations on n bit operands, using one multiplexed 
k bit serial/parallel multiplier in (n + k) effective clock cycles, accomplishing the 
equivalent of three multiplication computations, as prescribed by Montgomery. 

By synchronizing and on the fly detecting and on the fly preloading and 
simultaneous addition of next to be used operands, the machine operates in a 
deterministic fashion, wherein all multiplications and exponentiations are executed in a 
predetermined number of clock cycles. Conditional branches are replaced with local 
detection and compensation devices, thereby providing a basis for the simple type control 
mechanism, which, when refined, typically include a series of self-exciting cascaded 
counters. The basic operations herein described can be executed in deterministic time 
using the device described in US Patent 5,513,133 as manufactured both by Motorola in 
East Kilbride, Scotland under the trade name SC-49, and by SGS-Thomson in Rousset, 
France, under the trade name ST16-CF54. 

The machine has particularly lean demands on volatile memory for most 
operations, as operands are loaded into and stored in the device for the total length of the 
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operation; however, the machine preferably exploits the CPU onto which it is appended, 
to execute simple loads and unloads, and sequencing of commands to the machine, whilst 
the machine performs its large number computations. The exponentiation processing 
time is virtually independent of the CPU which controls it. In practice, no architectural 
changes are necessary when appending the machine to any CPU. The hardware device is 
self-contained, and can be appended to any CPU bus. 

Apparatus for accelerating the modular multiplication and exponentiation process 
is preferably provided, including means for precomputing the necessary constants. 

The preferred embodiments of the invention described herein provide a modular 
mathematical operator for public key cryptographic applications on portable Smart 
Cards, typically identical in shape and size to the popular magnetic stripe credit and bank 
cards. Similar Smart Cards (as per US Patent 5,513,133) are being used in the new 
generation of public key cryptographic devices for controlling access to computers, 
databases, and critical installations; to regulate and secure data flow in commercial, 
military and domestic transactions; to decrypt scrambled pay television programs, etc. It 
should be appreciated that these devices are also incorporated in computer and fax 
terminals, door locks, vending machines, etc. 

The hardware described carries out modular multiplication and exponentiation by 
applying the (P operator in a novel way. Further, the squaring can be carried out in the 
same method, by applying it to a multiplicand and a multiplier that are equal. Modular 
exponentiation involves a succession of modular multiplications and squarings, and 
therefore is carried out by a method which comprises the repeated, suitably combined 
and oriented application of the aforesaid multiplication, squaring and exponentiation 
methods. 

When describing the workings of a preferred embodiment of the ALU we describe 
synchronization in effective clock cycles, referring to those cycles when the unit is 
performing an arithmetic operation, as opposed to real clock cycles, which would include 
idle cycles whence the ALU may stand, and multiplexers, flipflops, and other device 
settings may be altered, in preparation for a new phase of operations. 

In a preferred embodiment, a method for executing a Montgomery modular 
multiplication, (with reference to squaring and normal multiplication) wherein the 
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multiplicand A (which may be stored either in the CPU's volatile RAM or in the S A 
register, 130, the multiplier B in the B register 1000, which is a concatenation of 70 and 
80 and the modulus N in the N register, 1005, which is a concatenation of 200 and 210; 
comprise m characters of k bits each, the multiplicand and the multiplier generally not 
being greater than the modulus, comprises the steps of: 

1) - loading the multiplier B and the modulus, N, into respective registers of n bit 

length, wherein n = m k; 

{multiplying in normal field positive, natural, integers, N is a second 
multiplier} 

{if n is longer than the B, N and S registers, values are typically loaded and 
unloaded in and out of these registers during the course of an iteration, 
allowing the machine to be virtually capable of manipulating any length of 
modulus} 

2) - setting the output of the register S B to zero, S*d Flush (250)= 1 for the first iteration; 

3) - resetting extraneous borrow and carry flags (controls, not specified in the patent); 

4) - executing m iterations, each iteration comprising the following operations: 

(0<i <m-l) 

a) transferring the next character Am of the multiplicand A from volatile storage to 
the Ai Load Buffer, 290. 

b) simultaneously serially loading the Ci Load Buffer, 320, with N 0 (the LS k bits of 
N), while rotating the contents of the Ai Load Buffer, thereby serially adding the 
contents of the Ai load buffer with No by means of the serial adder FA1, 330, 
thereby serially loading the Ai + Ci Load Buffer with the sum No + Am, 

The preloading phase ends here. This phase is typically executed whilst the MAP was 
performing a previous multiplication iteration. Processes a) and b) can be executed 
simultaneously, wherein the Am character is loaded into its respective register, whilst the 
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Ai stream is synchronized with the rotation of the No register, loading R2, 320. 
Simultaneously, the Ai stream and the N 0 stream are summated and loaded into the R3 
register, 340. 

Squaring a quantity from the B register, can be executed wherein at the initialization, 
Steps a) and b) the first k bits of Bd are inserted into Rl, as the B 0 register is rotated, 
simultaneously with the N 0 register. Subsequent k bit B ; strings are preloaded into the Rl 
register, as they are fed serially into the ALU. 

c) the machine is stopped. Operands in buffers Rl, R2, and R3 are latched into latches 
LI, 360; L2, 370; and L3, 380. 

The L0 - "0" latch, is a pseudo latch, as this is simply a literal command signal 
entering each of the AND gates in the inputs or outputs of the 390, multiplexer. 

d) for the next k effective clock cycles- 

i) at each effective clock cycle the Y0 SENSE anticipates the next bit of Y 0 and 
loads this bit through M3 multiplexer, 300, into the Ci Load Buffer, while 
shifting out the Ai bits from the Rl register and simultaneously loading the Ci 
Load Buffer with k bits of Y 0 and adding the output of Rl with Y 0 and loading 
this value into the R3 Buffer, 

ii) simultaneously multiplying N 0 (in L2, Ci Latch) by the incoming Y 0 bit, and 
multiplying Ai by the next incoming bit of Bd, by means of logically choosing 
through the M_K multiplexer, 390, the desired value from one of the four 
latches, LO, LI, L2, or L3; thereby adding the two results. If neither the Y 0 bit 
nor the B bit is one, an all zero value is multiplexed into the CSA, if only the N 0 
bit is one, No alone is multiplexed/added into the CSA, if only the B bit is a one, 
Aj.! is added into the CSA, if both the B bit and the N 0 bit are ones, then As. 
i+No are added into the CSA, 
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iii) then adding to this summation; as it serially exits the Carry Save k+1 Bit 
Accumulator bit by bit, (the X stream); the next relevant bit of Sd in through the 
serial adder, FA2, 460, 

In MM these first k bits of the Z stream are zero. 
In this first phase the result of Y 0 -No +Am -B 0 + S 0 has been computed, the LS k all zero 
bits appeared on the Z*out stream, and the MS k+1 bits of the multiplying device are 
saved in the CSA Carry Save Accumulator; the Rl, R2 and R3 preload buffers hold the 
values Aj.i t Y 0 and Yo + Am, respectively. 

e) at the last effective, (m+l) k'th, clock cycle he machine is stopped, buffers R2, and 
R3 are latched into L2, and L3 
The value of LI is unchanged. 

The initial and continuing conditions for the next k- (m-1) effective clock cycles are: 
the multipliers are the bit streams from B, starting from the k'th bit of B and the 
remaining bit stream from N, also starting from the k'th bit of N; 

and the multiplicands in LI, L2, and L3 are A;.j, Yo, and Y 0 + Am, at the start the 
CS adder contains the value as described in d), and the S stream will feed in the next 
k(m-l) bits into the FA2 full adder; 

during the next km effective clock cycles, Nd, delayed k clock cycles in unit 470, is 
subtracted in serial subtractor, 480, from the Z stream, to sense if (Z/2 k mod 2 k * m ) > 
the result which is to go into the B or S register, is larger than or equal to N. 
Regardless of what is sensed by the serial subtra^or, 460, if at the {(m+l)*k}'th 
effective clock cycle, the SOi flip-flop of the CSA is a one, then the total result is 
certainly larger than N, and N will be subtracted from the result, as the result, partial 
or final, exits its register. 
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0 for the next k- (m-1) effective clock cycles: 

the N 0 Register, 210, is rotated either synchronously with incoming A; bits, or at 
another suitable timing, loading Rl, R2, and R3, as described in a) and b), for the 
next iteration, 

for these k(m-l) effective clock cycles, the remaining MS bits of N now multiply 
Y 0> the remaining MS B bits continue multiplying A^. If neither the N bit nor the 
B bit is one, an all zero value is multiplexed into the CS A. If only the N bit is one, 
Yo alone is multiplexed/added into the CSA. If only the B bit is a one, A M is added 
into the CSA. If both the B bit and the Y 0 bit are ones, then Ai_i + Y 0 are added 
into the CSA. 

Simultaneously the serial output from the CSA is added to the next k- (m-1) S bits 
through the FA2 adder, unit 460, which outputs the Z stream, 

the relevant part of the Z output stream is the first non-zero k- (m-1) bits of Z. 

The Z stream is switched into the S B register, for the first m-1 iterations and into 
the S B or B register, as defined for the last iteration; 

on the last iteration, the Z stream, which, disregarding the LS k zero bits, is the 
final B* stream. This stream is directed to the B register, to be reduced by N, if 
necessary, as it is used in subsequent multiplications and squares; 

on the last iteration, Nd, delayed k clock cycles, is subtracted by a serial subtractor 
from the Z stream, to sense if the result, which goes into B, is larger than or equal 
toN. 

*■ 

At the end of this stage, all the bits from the N, B, and Sb registers have been fed into the 
ALU, and the final k+1 bits of result are in the CSA, ready to be flushed out. 
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g) the device is stopped. The S flush, 250; the B flush, 240; and the N flush, 260; are 

set to output zero strings, to assure that in the next phase the last k+1 most 
significant bits will be flushed out of the CSA. (In a regular multiplication, the M7 
MUX, 450, is set to accept the Last Carry from the previous iteration of S,) S has 
mk+1 significant bits, but the S register has only mk cells to receive this data. This 
last bit is intrinsically saved in the overflow mechanism. 

As was explained in e, Nd, delayed k clock cycles in 470, is subtracted from the Z 
stream, synchronized with the significant outputs from X, to provide a fine-tune to 
the sense mechanism to determine if the result which goes into the B or S register 
is larger than or equal to N. 480 and 490 comprise a serial comparator device, 
where only the last borrow command bit for modular calculations, and the 
(k*m-H)'th bit for regular multiplications in the natural field of integers are saved. 

this overflow/borrow command is detected at the nvk'th effective clock cycle. 

h) The device is clocked another k cycles, completely flushing out the CSA, while 

another k bits are exiting Z to the defined output register. 

The instruction to the relevant flip flop commanding serial subtractor 90 or 500 to 
execute a subtract of N on the following exit streams is set at the last effective, 
(m+l)k'th, clock cycle, of the iteration if (Z/2 k -N)>N (Z includes the mk'th MS 
bit), sensed by, the following signals: 

the SOi bit, which is the data out bit from second least significant cell of the CSA, 
is a one, 

* 

or if the CO z bit, which is the internal carry out in the X+S adder, 460, is a one. 
or if the borrow bit from the 480 sense subtractor is not set. 
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This mechanism appears in US Patent 5,513,133 as manufactured both by 
Motorola and SGS-Thomson. 

For multiplication in the field of natural numbers, it is preferable to detect an 
overflow, if the m-k'th MS bit is a one, can happen in the superscalar multiplier, 
and cannot happen in the mechanism of US Patent 5,513,133. This overflow can 
then be used in the next iteration to insert a MS one in the S (temporary result) 
stream. 

j) is this the last iteration 
NO, return to c) 
YES continue to m) 

k) the correct value of the result can now exit from either the B or S register. 

Y 0 bits are anticipated in the following manner in the Y0S-Y0SENSE unit, 430, from 
five deterministic quantities: 

i the LS bit of the Ai - LI Latch AND the next bit of the Bd Stream; Ao-B d ; 

ii the LS Carry Out bit from the Carry Save Accumulator; COo; 

iii the S ou t bit from the second LS cell of the CSA; SOi; 

iv the next bit from the S stream, S d) 

v the Carry Out bit from the 460, Full Adder; CO z ; 

These five values are XORed together to produce the next Y 0 bit; Y<k: 

Y oi = Ao-B d © COo © Sd © S d © CO z 
If the Y 0 i bit is a one, then another N of the same rank (multiplied by the necessary 
power of 2), is typically added, otherwise, N, the modulus, is typically not added. 

Multiplication of long natural integers in the normal field of numbers. 

This apparatus is suited to efficiently perform multiplications and summations of 
normal integers. If these operands are all of no longer than k bit length, the process 
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preferably is executed without interleave, where the Z stream of 2k +1 bits are directed 
to final storage. For integers longer than k bits, the process is similar to the predescribed 
interleaved modular arithmetic process, excepting that the result will now potentially be 
one bit longer than twice the length of the longest operand. Further the apparatus of the 
invention is capable, using the resources available in the described device, to 
simultaneously perform two separate multiplications, A, multiplicand, preferably loaded 
in segments in the Rl- Ai register, times B, the multiplier, of A, preferably loaded into 
the B register as previously designated, plus N, a second multiplier, preferably loaded 
into the N register, times an operand, C, loaded into the R2 Register, plus S, a bit stream 
entering the apparatus, on the first iteration, only from the Sd, signal line, preferably 
from the S A register. The YO SENSE apparatus is not used. The multiplicands are 
summated into the R3 register prior to the initiation of an iteration. At initiation of the 
iteration, registers Rl, R2, and R3 are copied into latches LI, L2, and L3 until the end of 
an iteration. Meanwhile, during the mk + k + 1 effective clock cycles of an iteration, the 
next segments of A and C are again preloaded and summated in preparation for the next 
iteration. 

At each iteration, the first LS k bits of the result on the Z stream, which are, now, 
(not by definition zero, as in MM) directed to a separate storage, vacated to accumulate 
the LS portion of the result, again suitably the Sa register. The most significant mk + 1 
bits comprise the Sb, temporary quantity, for the next iteration. In the last phase, similar 
to g, i, and j, the CSA is flushed out of accumulated value. The LS portion, for numbers 
which are longer than the multiplier registers, can be exited through the normal data out 
register and unloader, units 60 and 30, respectively. 

The MS, 2m'th bit of the result is read from the LAST CARRY bit of the FA2, 
unit 460, through the OVERFLOW signal line. 

BRIEF DESCRIPTION OF THE DRAWINGS 
The present invention will be understood and appreciated from the following 
detailed description, taken in conjunction with the drawings in which: 
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Figs. 1 A - IB, taken together, form a simplified block diagram of a serial-parallel 
super scalar arithmetic logic unit (ALU) constructed and operative in accordance with 
one embodiment of the present invention; 

Fig. 2 is a simplified block diagram of a preferred implementation of the serial 
integer division apparatus of Fig. 1A which is also useful separately for serial integer 
division applications, particularly for very large numbers; 

Fig. 3 is a simplified block diagram of a public key crypto-computer for smart 
cards or terminals which includes the serial-parallel arithmetic logic unit of Figs. 1 A - IB; 

Fig. 4 is a table showing stages of operation of a division of a dividend by a divisor 
using the division apparatus of Fig. 2, for an example wherein the effective bit-length of 
the divisor is half of the effective bit-length of the dividend; and 

Fig. 5 is a table showing stages of operation of a division of a dividend by a divisor 
using the division apparatus of Fig. 2, for an example wherein the effective bit-length of 
the divisor is less than half of the effective bit-length of the dividend. 

DESCRIPTION OF PREFERRED EMBODIMENTS 
Figs. 1A - IB, taken together, form a simplified block diagram of a serial-parallel 
arithmetic logic unit (ALU) constructed and operative in accordance with a preferred 
embodiment of the present invention. The apparatus of Figs. 1 A - IB, preferably include 
the following components: 

Single Multiplexers - Controlled Switching Elements which select one signal or bit 
stream from a multiplicity of inputs of signals and direct it this chosen signal to a 
single output. Multiplexers are marked Ml to Ml 3, and are intrinsic parts of larger 
elements. 

M_K Multiplexer, 390, is an array of k+1 single multiplexers, and chooses which of the 
four k or k+1 inputs are to be added into the CSA, 410. 

The B (1000), S A (130), S B (180), and N (1005) are the four main serial main registers in 
a preferred embodiment. The S A is conceptually and practically redundant, but can 
considerably accelerate very long number computations, and save volatile memory 
resources, especially in the case where the length of the modulus is 2*k-m bits long, 
and also simplify long division computations. 
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Serial Adders and Serial Subtracters are logic elements that have two serial inputs and 
one serial output, and summate or perform subtraction on two long strings of bits. 
Components 90 and 480 are subtracters, 330, and 460 are serial adders. The 
propagation time from input to output is very small. Serial subtracter 90 reduces B* 
to B if B* is larger than or equal to N. Serial Subtracter 480, is used, as part of a 
comparator component to detect if B* will be larger than or equal to N. Full Adder 
330, adds the two bit streams which feed the Load Buffer 340, with a value that is 
equal to the sum of the values in the 290 and 320 Load Buffers. 

Fast Loaders and Unloaders, 10 and 20, and 30 and 40, respectively, are devices to 
accelerate the data flow from the CPU controller. This can comprise of DMA or 
other hardware accelerators, in a preferred embodiment. 20 and 40 are for reversing 
the data word, as is necessary for reconciling the division input and output of Fig. 2. 

Data In, 50, is a parallel in serial out device, as the present ALU device is a serial fed 
systolic processor, and data is fed in, in parallel, and processed in serial. 

Data Out, 60, is a serial in parallel out device, for outputting results from the 
coprocessor. The quotient generator is that part of Fig. 2 which generates a quotient 
bit at each iteration of the dividing mechanism. 

Flush Signals on Bd, 240; on S*d, 250; and on Nd, 260, are made to assure that the last 
k+1 bits can flush out the CSA, as the alternative would be a complicated k+1 bit 
parallel output element to retrieve the MS k+1 bits of the accumulator. 

Load Buffers Rl, 290; R2, 320; and R3, 340 are serial in parallel out shift registers 
adapted to receive the three possible more than zero multiplicand combinations. 

Latches LI, 360; L2, 370; and L3, 380; are made to receive the outputs from the load 
buffers, thereby allowing the load buffers, the temporal enablement to process the 
next phase of data before this data is preferably latched into L2, L2, and L3. 

Y0 Sense, 430, is the logic device which determines the number of times the modulus is 
accumulated, in order that a k bit string of LS zeros will exit at Z in Montgomery 
Multiplications and squares. 

One bit delay devices 100, 220 and 230 are inserted in the respective data streams to 
accommodate for synchronization problems between the data preparation devices in 
Fig. 1 A, and the data processing devices in Fig. IB. 
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The k bit delay, shift register, 470, assures that if Z/2 k is larger than or equal to N, the 

comparison of Z/2 k and N will be made with synchronization. 
The Cany Save Accumulator is almost identical to a serial/parallel multiplier, excepting 

for the fact that three different larger than zero values can be summated, instead of 

the single value as usually is latched onto the input of the s/p multiplier. 
The Insert Last Carry, 440, is used to insert the mk+Tth bit of the S stream, as the S 

register is only mk bits long. 
The borrow/overflow detect, 490, can either detect if a result is larger than or equal to 

the modulus (from N), or if the mk'th bit is a one. 
The control mechanism is not depicted, but is preferably understood to be a set of 

cascaded counting devices, with switches set for systolic data flow. 

For modular multiplication in the prime and composite prime field of numbers, we 
define A and B to be the multiplicand and the multiplier, and N to be the modulus which 
is usually larger than A or B. N also denotes the register where the value of the modulus 
is stored. N, may, in some instances, be smaller than A. We define A, B, and N as 
m k = n bit long operands. Each k bit group will be called a character, the size of the 
group defined by the size of the multiplying device. Then A, B, and N are each m 
characters long. For ease in following the step by step procedural explanations, assume 
that A, B, and N are 512 bits long, (n = 512); assume that k is 64 bits long because of the 
present cost effective length of such a multiplier, and data manipulation speeds of simple 
CPUs; and m = 8 is the number of characters in an operand and also the number of 
iterations in a squaring or multiplying loop with a 512 bit operand. All operands are 
positive integers. More generally, A, B, N, n, k and m may assume any suitable values. 

In non-modular functions, the N and S registers can be used for temporary storage 

of other arithmetic operands. 

We use the symbol, to denote congruence of modular numbers, for example 

* 

16 = 2 mod 7, and we say 16 is congruent to 2 modulo 7 as 2 is the remainder when 16 
is divided by 7. When we write Y mod N = X mod N; both Y and X may be larger than 
N; however, for positive X and Y, the remainders will be identical. Note also that the 
congruence of a negative integer Y, is Y + u N, where N is the modulus, and if the 
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congruence of Y is to be less than N, u will be the smallest integer which will give a 
positive result. 

We use the symbol, ¥, to denote congruence in a more limited sense. During the 
processes described herein, a value is often either the desired value, or equal to the 
desired value plus the modulus. For example X ¥ 2 mod 7. X can be equal to 2 or 9. 
We say X has limited congruence to 2 mod 7. 

When we write X = A mod N, we define X as the remainder of A divided by N; 
e.g., 3 = 45 mod 7. 

In number theory the modular multiplicative inverse is a basic concept. For 

example, the modular multiplicative inverse of X is written as X~l, which is defined by 

XX- 1 modN=l. If X = 3, and N=13, then X- ] =9, i.e., the remainder of 

3 9 divided by 13 is 1. 

The acronyms MS and LS are used to signify most significant and least significant 
when referencing bits, characters, and full operand values, as is conventional in digital 
nomenclature. 

Throughout this specification N designates both the value N, and the name of the 
shift register which contains N. An asterisk superscript on a value, denotes that the 
value, as stands, is potentially incomplete or subject to change, A is the value of the 
number which is to be exponentiated, and n is the bit length of the N operand. After 
initialization when A is "Montgomery normalized" to A* (A*=2 n A - to be explained 
later) A* and N are constant values throughout the intermediate step in the 
exponentiation. During the first iteration, after initialization of an exponentiation, B is 
equal to A*. B is also the name of the register wherein the accumulated value which 
finally equals the desired result of exponentiation resides. S* designates a temporary 
value, and S, Sa and Sb designate, also, the register or registers in which all but the single 
MS bit of S is stored. (S* concatenated with this MS bit is identical to S.) S(i-l) denotes 
the value of S at the outset of the i'th iteration; Sq denotes the LS character of an S(i) *th 
value. 

We refer to the process, (defined later) (P(AB)N as multiplication in the P field, or 
sometimes, simply, a multiplication operation. 
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As we have used the standard structure of a serial/parallel multiplier as the basis for 
constructing a double acting serial parallel multiplier, we differentiate between the 
summating part of the multiplier, which is based on carry save accumulation, (as opposed 
to a carry look ahead adder, or a ripple adder, the first of which is considerably more 
complicated and the second very slow), and call it a carry save adder or accumulator, 
and deal separately with the preloading mechanism and the multiplexer and latches, 
which allow us to simultaneously multiply A times B and C times D, summate both 
results, e.g., A-B+C-D, converting this accumulator into a very powerful engine. 
Additional logic is added to this multiplier in order to provide for an anticipated sense 
operation necessary for modular reduction and serial summation necessary to provide 
powerful modular arithmetic and ordinary integer arithmetic on very large numbers. 

Montgomery Modular Multiplication 

In a classic approach for computing a modular multiplication, AB mod N, the 
remainder of the product AB is computed by a division process. Implementing a 
conventional division of large operands is more difficult to perform than serial/parallel 
multiplications. 

Using Montgomery's modular reduction method, division is essentially replaced by 
multiplications using two precomputed constants. In the procedure demonstrated herein, 
there is only one precomputed constant, which is a function of the modulus. This 
constant is, or can be, computed using this ALU device. 

A simplified presentation of the Montgomery process, as is used in this device is 
now provided, followed by a complete preferred description. 

If we have an odd number (an LS bit one), e.g., 1010001 (=8 1 io) we can always 
transform this odd number to an even number (a single LS bit of zero) by adding to it 
another fixing, compensating odd number, e.g., 1111 (=15i 0 ); as 1111 + 1010001 - 
1100000 (96 io). In this particular case, we have found a number that produced five LS 
zeros, because we knew in advance the whole string, 81, and could easily determine a 
binary number which we could add to 81, and would produce a new binary number that 
would have as many LS zeros as we might need. This fixing number is be odd, else it has 
no effect on the progressive LS bits of a result. 
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If our process is a clocked serial/parallel carry save process, where it is desired to 
have a continuous number of LS zeros, and wherein at each clock cycle we only have to 
fix the next bit, at each clock it is sufficient to add the fix, if the next bit would 
potentially be a one or not to add the fix if the potential bit were to be a zero. However, 
in order not to cause interbit overflows (double carries), this fix is preferably summated 
previously with the multiplicand, to be added into the accumulator when the relevant 
multiplier bit is one, and the Y Sense also detects a one. 

Now, as in modular arithmetic, we only are interested in the remainder of a value 
divided by the modulus, we know that we can add the modulus any number of times to a 
value, and still have a value that would have the same remainder. This means that we can 
add YN=Zyi2' N to any integer, and still have the same remainder; Y being the number 
of times we add in the modulus, N, to produce the required LS zeros. As described, the 
modulus that we add can only be odd. Methods exist wherein even moduli are defined as 
2' times the odd number that results when i is the number of LS zeros in the even 
number. 

The problem solved by the Montgomery interleaved variations, is aimed at 
reducing the limited storage place we have for numbers, and the cost effective size of the 
multipliers. This is especially useful when performing public key cryptographic functions 
where we are constantly multiplying one large integer, e.g., n=1024 bit, by another large 
integer; a process that would ordinarily produce a double length 2048 bit integer. 

We can add in Ns (the modulus) enough times to A B=X or AB+S=X during the 
process of multiplications (or squaring) so that we will have a number, Z, that has n LS 
zeros, and, at most, n+1 MS bits. 

We can continue using such numbers, disregarding the LS n bits, if we remember 
that by disregarding these zeros, we have divided the desired result by 2 n . 

When the LS n bits are disregarded, and we only use the most significant n (or 
n+1) bits, then we have effectively multiplied the result by 2*", the modular inverse of 2 n . 
If we would subsequently re-multiply this result by 2 n mod N (or 2 n ) we would obtain a 
value congruent to the desired result (having the same remainder) as A-B+S mod N. As 
is seen, using MM, the result is preferably multiplied by 2 211 to overcome the 2"" parasitic 
factor reintroduced by the MM. 
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Example: 

A-B+S mod N = (12-11+10) mod 13 = (1 100-101 1+1010) 2 mod 101 1 2 - 

We will add in 2 l N whenever a fix is necessary on one of the n LS bits. 

B 1011 
x A 1100 
add S 1010 
add A(0) -B 0000 

sum of LS bit = 0 not add N 

add 2° (N-0) 0000 

sum 0101 ->0 LS bit leaves carry save adder 

add A(l) *B 0000 

sum of LS bit = 0 - add N 

add 2MN-1) 1101 

sum 1001 ->0 LS bit leaves CS adder 

add A(2) -B 1011 

sum LS bit = 0 don' t add N 

add 2 2 (N-0) 0000 

sum 1010 ->0 LS bit leaves CS adder 

add A(3) -B 1011 

sum LS bit = 1 add N 

add 2 3 (N-1) 1101 

sum 10001 ->0 LS bit leaves CS adder 



And the result is 10001 0000 2 mod 13=17-2 4 mod 13. 

€ 

As 1 7 is larger than 1 3 we subtract 1 3 , and the result is: 
17 -2 4 s 4-2 4 mod 13. 
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formally 2" n (AB+S)mod N = 9 (1211+10) mod 13 = 4 

In Montgomery arithmetic we utilize only the MS non-zero result (4) and 
effectively remember that the real result has been divided by 2 n ; n zeros having been 
forced onto the MM result. 

We have added in (8+2) 13=1013 which effectively multiplied the result by 
2 4 mod 13 s 3. In effect, had we used the superfluous zeros, we can say that we have 
performed, AB+Y-N+S - (12-11+10-13+10) in one process, which will be described 
possible on a preferred embodiment. 

Check- (12*1 1+10) mod 13 = 12; 4 - 3 = 12. 

In summary, the result of a Montgomery Multiplication is the desired result 
multiplied by 2' n . 

To retrieve the previous result back into a desired result using the same 
multiplication method, we would have to Montgomery Multiply the previous result by 
2 2n , which we will call H, as each MM leaves us with a parasitic factor of 2"". 

The Montgomery Multiply function <P(AB)N performs a multiplication modulo N 
of the AB product into the P field. (In the above example, where we derived 4). The 
retrieval from the P field back into the normal modular field is performed by enacting P 
on the result of ^(AB)N using the precomputed constant H. Now, if P = <P(A-B)N, it 
follows that ^(P-H)N s A-B mod N; thereby performing a normal modular multiplication 

in two P field multiplications. 

Montgomery modular reduction averts a series of multiplication and 
division operations on operands that are n and 2n bits long, by performing a series of 
multiplications, additions, and subtractions on operands that are n or n+1 bits long. The 
entire process yields a result which is smaller than or equal to N. For given A, B and odd 
N there is always a Q, such that A B + Q N will result in a number whose n LS bits are 
zero, or: 

P2 n = AB + QN 

This means that we have an expression 2n bits long, whose n LS bits are 

zero. 
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Now, let I-2 n si mod N (I exists for all odd N). Multiplying both sides of the 
previous equation by I yields the following congruences; 
from the left side of the equation: 

P I-2 n =P mod N; (Remember that I 2 n = 1 mod N) 
and from the right side: 

A-B-I + Q-N-I s ABI mod N ; (Remember that Q N I s 0 mod N) 
therefore: 

P = A-B-I mod N . 

This also means that a parasitic factor 1=2" mod N is introduced each time a P field 

multiplication is performed. 

We define the (P operator such that: 

P = A-B-I mod N = ^(A-B)N. 

and we call this "multiplication of A times B in the P field", or Montgomery 
Multiplication. 

The retrieval from the P field can be computed by operating (P on P-H, making: 

<P(P-H)N = A-B modN; 
We can derive the value of H by substituting P in the previous congruence. We find: 

<P(PH)N s (A-B-I)(H)(I) mod N ; 

(see that A B I <- P; H«-H; I<- and any multiplication 
operation introduces a parasitic I) 

If H is congruent to the multiple inverse of I 2 then the congruence is 
valid, therefore: 

H = I- 2 modN = 2 2n modN 

(H is a function of N and we call it the H parameter) 
In conventional Montgomery methods, to enact the IP operator on A*B, 
the following process may be employed, using the precomputed constant J: 

1) X = A-B 

2) Y - (X-J) mod 2 n (only the n LS bits are necessary) 

3) Z = X + Y-N 
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4) S - Z / 2 n (The requirement on J is that it forces Z to be 

divisible by 2 n ) 

5) P ¥ S mod N (N is to be subtracted from S, if S > N) 
Finally, at step 5) : 

P¥0>(AB)N, 

[After the subtraction of N, if necessary: 
P = /?(AB)N.] 

Following the above: 

Y - A-B-J mod 2 n (using only the n LS bits) ; 

and: 

Z = A-B + (A-BJ mod 2 n )-N. 

In order that Z be divisible by 2 n (the n LS bits of Z are preferably zero) 
and the following congruence will exist: 

[A-B + (A B J mod 2 n )-N] mod 2 n 3 0 

In order that this congruence will exist, N J mod 2 n are congruent to -1 

or: 

J = -N" 1 mod 2 n . 
and we have found the constant J. 

J, therefore, is a precomputed constant which is a function of N only. 
However, in a machine that outputs a MM result, bit by bit, provision should be made to 
add in Ns at each instance where the output bit in the LS string would otherwise have 
been a zero, thereby obviating the necessity of precomputing J and subsequently 
computing Y = A-B-J mod 2 n , as Y can be detected bit by bit using hardwired logic. We 
have also described that this methodic can only work for odd Ns. 

Therefore, as is apparent, the process described employs three multiplications, one 
summation, and a maximum of one subtraction, fof the given A, B, N, and a 
precomputed constant to obtain iP(A-B)N. Using this result, the same process and a 
precomputed constant, H, (a function of the module N) we are able to find A B mod N. 
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As A can also be equal to B, this basic operator can be used as a device to square or 

multiply in the modular arithmetic. 

Interleaved Montgomery Modular Multiplication 

The previous section describes a method for modular multiplication which involved 
multiplications of operands which were all nbits long, and results which required 
2n + 1 bits of storage space. 

Using Montgomery's interleaved reduction (as described in the aforementioned 
paper by Dusse), it is possible to perform the multiplication operations with shorter 
operands, registers, and hardware multipliers; enabling the implementation of an 
electronic device with relatively few logic gates. 

First we will describe how the device can work, if at each iteration of the 
interleave, we compute the number of times that N is added, using the J 0 constant. Later, 
we describe how to interleave, using a hardwire derivation of Y 0 , which will eliminate the 
Jo+ phase of each multiplication {2) in the following example}, and enable us to integrate 
the functions of two separate serial/multipliers into the new single generic multiplier 
which can perform A B+CN+S at better than double speed using similar silicon 
resources. 

Using a k bit multiplier, it is convenient to define characters of k bit length; there 
are m characters in n bits; i.e., mk = n. 
Jo will be the LS character of J. 
Therefore: 

Jo s -No"* mod 2^ (Jo exists as N is odd). 

Note, the J and Jo constants are compensating numbers that when enacted 
on the potential output, tell us how many times to add the modulus, in order to have a 
predefined number of least significant zeros. We will later describe an additional 
advantage to the present serial device; since, as the next serial bit of output can be easily 
determined, we can always add the modulus (always odd) to the next intermediate result. 
This is the case if, without this addition, the output bit, the LS serial bit exiting the CSA, 
would have been a "1"; thereby adding in the modulus to the previous even intermediate 
result, and thereby promising another LS zero in the output string. Remember, 
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congruency is maintained, as no matter how many times the modulus is added to the 
result, the remainder is constant. 

In the conventional use of Montgomery's interleaved reduction, /P(A*B)N is enacted 

in m iterations as described in steps (1) to (5): 

Initially S(0) = 0 (the ¥ value of S at the outset of the first iteration). 
Fori = 1, 2....m : 

1) X = S(i-l) + Aj.i-B (Aj_] is the i-1 th character of A ; S(i-l) is the value 

of S at the outset of the i'th iteration.) 

2) Y 0 = Xo- Jo mod 2 k (The LS k bits of the product of Xo- Jo) 

(The process uses and computes the k LS bits only, 
e.g., the least significant 64 bits) 

In the preferred implementation, this step is obviated, because in a serial 
machine Y 0 can be anticipated bit by bit. 

3) Z = X + Y 0 -N 

4) S(i) = Z/2 k (The k LS bits of Z are always 0, therefore Z is always divisible 

by 2 k This division is tantamount to a k bit right shift as the LS k bits of Z are all zeros; 
or as will be seen in the circuit, the LS k bits of Z are simply disregarded. 

(5) S(i) = S(i) mod N (N is to be subtracted from those S(i)'s which are 
larger than N). 

Finally, at the last iteration (after the subtraction of N, when necessary), 
C = S(m) = /i D (A 'B)N. To derive F= A*B mod N, the P field computation, <P(C *H)N, 

is performed 

It is desired to know, in a preferred embodiment, that for all S(i)'s, S(i) is smaller 
than 2N. This also means, that the last result (S(m)) can always be reduced to a quantity 
less than N with, at most, one subtraction of N. 

We observe that for operands which are used in the process: 
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S(i-l) < 2 n+1 (the temporary register can be one bit longer than the B or N 
register), 

B<N<2 n and Aj.i <2 k . 
By definition: 

S(i) = Z/2 k (The value of S at the end of the process, before a 
possible subtraction ) 

For all Z, Z(i) <2 n+k+1 . 

= +Ai-B < T H - 1 + (2 k -l)(2M) 
Q ma x=YoN<(2 k -l)(2M) 

therefore: 

Z^^-f+l <2 k+n+, -l. 
and as Zma X is divided by 2 k : 
S(m)<2 n+1 -2 1 . 

Because N min > 2 n -2, S(m)max is always less than 2-N mi „, and therefore, one subtraction is 
all that is necessary on a final result. 

S(m) max -N min = (2 fl+1 .2 1 -l).(2 n -l) = 2 n - 4<N min . 
Example of a Montgomery interleaved modular multiplication: 

The following computations in the hexadecimal format clarify the meaning of the 
interleaved method: 

N = a59, (the modulo), A = 99b, (the multiplier), B = 5c3 (the multiplicand), n = 12, (the 
bit length of N), k = 4, (the size in bits of the multiplier and also the size of a character), 
and m = 3, as n = k m. 

Jo = 7 as 7-9 e -1 mod 16 and H s 2 2 * 12 mod a59 m 44b. 

The expected result is F = A B mod N = 99b*5c3 mod a59 * 37581 1 mod a59 = 220 16 . 
Initially: S(0) = 0 v 

Step I X = S(0) + Aq-B = 0 + b-5c3 = 3f61 

Y 0 = Xo- Jo mod 2 k = 7 (Y 0 - hardwire anticipated in new MAP) 
27 



WO 98/50851 



PCT/IL98/00148 



Z = X + Yo-N = 3f61 + 7-a59 = 87d0 
S(l) = Z/2 k = 87d 

Step 2 X = S(l) + Ai-B = 87d + 9-5c3 = 3c58 

Y 0 = Xo- Jo mod 2 k = 8-7 mod 2 4 = 8 (Hardwire anticipated) 
Z = X + Yo-N = 3c58 + 52c8 = 8f20 
S(2) = Z/ 2 k = 8f2 

Step 3 X = S(2) + A 2 B = 8f2 + 9-5c3 = 3ccd 

Y 0 = d-7 mod 2 4 = b (Hardwire anticipated) 
Z = X + Yo-N = 3ccd + b-a59 = aeaO 
S(3) = Z / 2 k = aea , 

as S(3) > N , 

S(m)=S(3) - N = aea - a59 = 91 
Therefore C = iP(AB)N = 91 16. 

Retrieval from the P field is performed by computing (P(C-H)N: 
Again initially: S(0) = 0 

Stepl X = S(0) + C 0 H = 0+l-44b = 44b 

Y 0 = d (Hardwire anticipated in new MAP) 
Z = X + Yo-N = 44b + 8685 = 8adO 
S(l) = Z/2 k = 8ad 

Step 2 X = S(l) + Ci-H = 8ad + 9-44b = 2f50 

Y 0 = 0 (Hardwire anticipated in new MAP) 
Z = X + Yo-N = 2f50 + 0 = 2fS0 
S(2) = Z/2 k = 2f5 
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Step 3 X - S(2) + C 2 H = 2f5 + 044b = 2f5 

Y 0 = 3 (Hardwire anticipated in new MAP) 
Z = X + Yo-N « 25 + 3-a59 - 2200 
S(3) = Z/2 k = 220 16 

which is the expected value of 99b 5c3 mod aS9. 

If at each step we disregard k LS zeros, we are in essence multiplying the n MS bits by 

2^. Likewise, at each step, the i'th segment of the multiplier is also a number multiplied 

by 2 l \ giving it the same rank as S(i). 

It can also be noted that in another preferred embodiment, wherein it is of 
some potential value to know the Jo constant, 
Exponentiation 

The following derivation of a sequence [D. Knuth, The art of computer 
programming, vol. 2: Seminumerical algorithms, Addison-Wesley, Reading Mass., 1981] 
hereinafter referred to as "Knuth", explains a sequence of squares and multiplies, which 
implements a modular exponentiation. 

After precomputing the Montgomery constant, H= 2 2n , as this device can both 
square and multiply in the P field, we compute: 

C- A E mod N. 

Let E(j) denote the j bit in the binary representation of the exponent E, starting with the 
MS bit whose index is 1 and concluding with the LS bit whose index is q, we can 
exponentiate as follows for odd exponents: 

A* ¥ 0>(A*H)N A* is now equal to A*2". 

FORj = 2TO q-1 
B¥<P(B-B)N 



29 



WO 98/50851 



PCTAL98/00148 



IF EG) = 1 THEN 

B¥0>(BA*)N 

ENDFOR 

B ¥ ^(BA)N E(0)=1 ; B is the last desired temporary result 
multiplied by 2", A is the original A. 

C = B 

C= C-N if C>N. 

After the last iteration, the value B is ¥ to A^ mod N, and C is the final value. 

To clarify, we shall use the following example: 

E = 101 1 > E(l) = 1; E(2) = 0; E(3) = 1; E(4) = 1; 

To find A 1011 modN;q = 4 

A* = /P(A-H)N = AI-2 1-AI- 1 mod N 
B = A* 

FORj = 2toq 

B = 0>(BB)N which produces: A 2 ^ 1 ) 2 -! = A^I" 1 

E(2) = 0; B = A 2 -I" 1 

j = 3 B = ^(BB)N = A 2 (I- 1 ) 2 I = A 4 I- 1 

E(3) = 1 B = 0>(B-A*)N = (A 4 -I _1 ) (AI-^I = A^-I" 1 

j = 4 B = rfP(B-B)N = A'O-I-^^A 10 !- 1 

As E(4) was odd, the last multiplication will be by A, to remove the parasitic I" 
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B = ^(B-A)N = A 10 *I- l *AI = A 1 1 
C = B 

A method for computing the H parameter by a reciprocal process is described in US 

Patent 5,513,133. 

A Long Division Process 

Fig. 2 illustrates a preferred embodiment of a deterministic processing device for 
performing long division, using the data manipulation devices available on the processor 
of Fig. 1A. 

Figs. 4 and 5 are examples of the operation of the apparatus of Fig. 2. 
In a division process, wherein the divisor, d, is in the range of d, 2 n_1 < d < 2 n ; and 
the dividend, D, is in the range of D, 2 2n ' x <D<2 7a ; the apparatus is used most simply, d 
is preloaded into the N right-shift register, 1005 (a concatenation of the N 0 and the Ni 
registers, 210 and 200) in Fig 2, the MS n bits of the dividend, D, are preloaded in the B 
right-shift register, (1000), and the n LS bits of D are reverse loaded into the S A right 
shift register (130). This is essentially, as one would arrange digits for manual long 
division realizing that the new LS bits are fed from S A to S B at each new trial subtraction. 
The S B register is preloaded with all zeros. 

In the initialization iteration, the overflow flip flop 170 of Fig. 2 is initially reset. 
TheN register, which now contains the divisor, d, is trial subtracted from B, the MS bits 
of D, whilst both the B and N registers are rotated; wherein their outputs are fed into the 
serial subtractor 90, whose output is B-fN, f=l for subtract, £=0, for don't subtract. The 
quotient generator, 120, is a detector which determines if B>N, and transmits a NEXT 
SUBTRACT which determines if d will be subtracted from B, in 90, in the next iteration; 
this signal is a one, if, and only if, B is larger than or equal to N. This NEXT 
SUBTRACT bit denotes success, and is the most significant bit of the quotient, and is 
also clocked into the S B register, 180. As this is clocked into the S B register, a zero is 
shifted out of the S B register, into the S A register, forcing a NEW LS BIT out of the S A 
register. Note, that as these registers are all right shift registers, both the quotient in S B 
and the dividend value in S A , are held in reverse order. 
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In the next iterations, at the first effective clock cycle, a NEW LS BIT, shifted out 
of the S A precedes the output, R, of 90, thereby multiplying R by 2 and adding the value 
of the NEW LS BIT to 2R. This concatenated value is rotated back into the B register, 
and is also tested in 120, to determine if on the next round f=l or £=0. 

Finally, the remainder is in the B register, n bits of the quotient are in the S B 
register and the most significant bits of the quotient are in the S A register. 

The most significant one of both the dividend and the divisor is preferably in the 
most significant bit cells of the B register and the N register, for all sizes of D and d. The 
number of iterations necessary to obtain a result is decreased for Ds which are smaller 
than 2 2n '\ and is increased for ds which are smaller than 2 n ~\ The device is hardware 
driven, and firmware compensations may be provided for shifting operands when 
unloading the device. 

The program residing in the non-volatile memory of Fig. 3, preferably ascertains 
that the registers are loaded, and defines for the control register, a number of iterations 
necessary for a. successful division process. The quotient bits are rereversed, byte by 
byte, when processed through the reverse data out unloader, units 60 and 40. 

This processor is an element useful for computing the H Parameter, and also is 
preferably used in computations of the Euclidean functions. 

The serial integer division apparatus of Fig. 1A and the double acting multiplier of Fig. 
IB which performs AxB + CxD + S, typically do not operate simultaneously. 

In the example of Fig. 4, the dividend B is 187 (10111011), the divisor N is 7 
(111), and once division is carried out, the quotient is found to be 20 (10100) and the 
remainder is found to be 7 (1 1 1). 

In the example of Fig. 5, the dividend B is 173 (10101101), the divisor N is 5 
(101), and once division is carried out, the quotient is found to be 34 (100010) and the 
remainder is found to be 3 (1 1). 

A carry-save accumulator is illustrated in Fig. 5 of U.S. Patent 5,513,133 to 
Gressel, 

A serial full adder is illustrated in Fig. 7 of the above-referenced U.S. Patent to 
Gressel. 
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A serial subtracter is illustrated in Fig. 8 of the above-referenced U.S. Patent to 
Gressel. 

Division apparatus is illustrated in Fig. 9 of the above-referenced U.S. Patent to 
Gressel. 

The term "normal field of integers" refers to non-negative integers e.g. natural 
numbers. 

According to a preferred embodiment of the present invention, the system shown 
and described herein is operative to compute Jq by resetting a and b to zero and setting 

s 0 = i. 

A portion of the disclosure of this patent document contains material which is 
subject to copyright protection. The copyright owner has no objection to the facsimile 
reproduction by anyone of the patent document or the patent disclosure, as it appears in 
the Patent and Trademark Office patent file or records, but otherwise reserves all 
copyright rights whatsoever. 

It is appreciated that the software components of the present invention may, if 
desired, be implemented in ROM (read-only memory) form. The software components 
may, generally, be implemented in hardware, if desired, using conventional techniques. 

It is appreciated that various features of the invention which are, for clarity, 
described in the contexts of separate embodiments may also be provided in combination 
in a single embodiment. Conversely, various features of the invention which are, for 
brevity, described in the context of a single embodiment may also be provided separately 
or in any suitable subcombination. 

It will be appreciated by persons skilled in the art that the present invention is not 
limited to what has been particularly shown and described hereinabove. Rather, the scope 
of the present invention is defined only by the claims that follow: 
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CLAIMS 

1 . A modular multiplication and exponentiation system comprising: 

a serial-parallel arithmetic logic unit (ALU) including a single modular multiplying 
device having a single carry-save adder. 

2. A system according to claim 1 which is operative to multiply at least one pair of 
integer inputs of any bit length. 

3. A system according to claim 2 wherein said at least one pair of integer inputs 
comprises two pairs of integer inputs. 

4. A system according to any of the preceding claims wherein said ALU is operative 
to generate a product of integer inputs and to reduce the size of said product without 
previously computing a zero-forcing Montgomery constant J 0 . 

5. A serial integer division system comprising: 

a serial division device operative to receive a dividend of any bit length and a 
divisor of any bit length and to compute a quotient and a remainder. 

6. A system according to any of claims 1 - 3 and also comprising a pair of registers 
for storing a pair of integer inputs, said system being operative to multiply a respective 
pair of integer inputs, at least one of which exceeds the bit length of its respective 
register, without interleaving. 

7. A system according to any of claims 1 - 3 and also comprising a serial division 
device operative to receive a dividend of any bit length and a divisor of any bit length and 
to compute a quotient and a remainder. 

8. A modular multiplication and exponentiation system comprising: 
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a serial-parallel multiplying device having only one carry-save accumulator and 
being operative to perform a pair of multiplications and to sum results thereof 

9. A modular multiplication and exponentiation method comprising: 

providing a serial-parallel arithmetic logic unit (ALU) including a single modular 
multiplying device having a single carry-save adder; and 

employing said serial-parallel ALU to perform modular multiplication and 
exponentiation. 

10. A method for natural (not modular) multiplication of large integers, the method 
comprising: 

providing a serial-parallel arithmetic logic unit (ALU) including a single modular 
multiplying device having a single carry-save adder; and 

employing said serial-parallel ALU to perform natural (not modular) multiplication 
of large integers. 

11. A method according to claim 10 wherein said employing step comprises 
multiplying a first integer of any bit length by a second integer of any bit length to obtain 
a first product, multiplying a third integer of any bit length by a fourth integer of any bit 
length to obtain a second product, and summing said first and second products with a 
fifth integer of any bit length to obtain a sum. 

12. A method according to claim 9 wherein said employing step comprises 
performing modular multiplication and exponentiation with a multiplicand, multiplier and 
modulus of any bit length. 

13. A system according to claim 8 and also comprising a double multiplicand 
precomputing system for executing Montgomery modular multiplication with only one 
precomputed constant. 



35 



WO 98/50851 



PCT/IL98/00148 



14. A method according to claim 9 wherein said employing step comprises 
performing Montgomery multiplication including: 

generating a product of integer inputs including a multiplier and a multiplicand; and 
executing modular reduction without previously computing a Montgomery 
constant Jo. 

15. A method according to claim 14 wherein said Montgomery constant J§ 

comprises a function of N mod 2^ where N is a modulus of said modular reduction and 
k is the bit-length of the multiplicand. 

16. A method according to claim 9 wherein said employing step comprises 
performing a sequence of interleaved Montgomery multiplication operations. 

17. A method according to claim 16 wherein each of said interleaved Montgomery 
multiplication operations is performed without previously computing the number of times 
the modulus must be summated into a congruence of the multiplication operation in 
order to force a result with at least k significant zeros. 

18. A system according to claim 8 and also comprising a data preprocessor 
operative to collect and serially summate multiplicands generated in an i'th interleaved 
Montgomery multiplication operation thereby to generate a sum and to feed in said sum 
to an (B-l)'th Montgomery multiplication operation. 

19. A method according to claim 15 wherein said function comprises an additive 
inverse of a multiplicative inverse of N mod 2 k . 

20. A method according to claim IS and also comprising computing Jq by resetting 
a and b to zero and setting Sq = 1. 
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21. A method according to claim 15 and also comprising computing Jq by resetting 
a and b to zero and setting Sq - 1 . 

22. A method according to claim 14 wherein said step of executing without 
previously computing comprises anticipating whether or not a modulus must be added to 
a multiplicative summation generated in the course of generating said product of integer 
inputs. 

23. A system according to claim 4 wherein said ALU comprises means for 
determining whether or not said product is or is not less than a modulus in which said 
modulus multiplying device is operating, thereby to determine whether or not to reduce 
the size of said product. 

24. A system according to claim 4 and also comprising a pair of registers for storing 
a pair of integer inputs, said system being operative to multiply a respective pair of 
integer inputs, at least one of which exceeds the bit length of its respective register, 
without interleaving. 

25. A system according to claim 4 and also comprising a serial division device 
operative to receive a dividend of any bit length and a divisor of any bit length and to 
compute a quotient and a remainder. 

26. A system according to claim 24 and also comprising a serial division device 
operative to receive a dividend of any bit length and a divisor of any bit length and to 
compute a quotient and a remainder. 
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